learner policy
- North America > United States (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Zero-Shot Coordination in Ad Hoc Teams with Generalized Policy Improvement and Difference Rewards
Nigam, Rupal, Parikh, Niket, Osooli, Hamid, Yuasa, Mikihisa, Heglund, Jacob, Tran, Huy T.
Abstract--Real-world multi-agent systems may require ad hoc teaming, where an agent must coordinate with other previously unseen teammates to solve a task in a zero-shot manner . Prior work often either selects a pretrained policy based on an inferred model of the new teammates or pretrains a single policy that is robust to potential teammates. Instead, we propose to leverage all pretrained policies in a zero-shot transfer setting. We formalize this problem as an ad hoc multi-agent Markov decision process and present a solution that uses two key ideas, generalized policy improvement and difference rewards, for efficient and effective knowledge transfer between different teams. We empirically demonstrate that our algorithm, Generalized Policy improvement for Ad hoc T eaming (GPA T), successfully enables zero-shot transfer to new teams in three simulated environments: cooperative foraging, predator-prey, and Overcooked. We also demonstrate our algorithm in a real-world multi-robot setting. Ad hoc teaming (AHT) is an open challenge for multi-agent systems, in which an autonomous agent must successfully coordinate with other unknown agents [1]. Consider a search-and-rescue mission where robots are deployed from different organizations and expected to cooperate with each other on the fly--these robots may have different biases in how they achieve a given objective (e.g., risky vs. risk-averse search) or have different capabilities (e.g., sensing vs. manipulation). Adapting to such differences would enable agents to effectively and autonomously complete tasks where the team is unknown prior to deployment.
- North America > United States > Illinois > Champaign County > Urbana (0.05)
- North America > United States > Illinois > Champaign County > Champaign (0.05)
- Asia > Japan (0.04)
- Government > Regional Government > North America Government > United States Government (0.46)
- Education (0.46)
A Proof for Equation 7 in Section 3.2
In Section 3.2, we propose a shifting operation in eq. As presented in Section 3.2, for an As explained in Sec 4.1, two criteria for the input distribution to the Tab. 5 shows the detailed results of The exact learned policy return are listed in Tab. 6. A higher return indicates a better learned policy.
- North America > United States > Minnesota (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Multi-Agent Guided Policy Optimization
Li, Yueheng, Xie, Guangming, Lu, Zongqing
Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an auto-regressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning. Our code and experimental data can be found in https://github.com/liyheng/MAGPO.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria (0.04)
- South America > Brazil > São Paulo (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Leader-follower formation enabled by pressure sensing in free-swimming undulatory robotic fish
Panta, Kundan, Deng, Hankun, DeLattre, Micah, Cheng, Bo
Fish use their lateral lines to sense flows and pressure gradients, enabling them to detect nearby objects and organisms. Towards replicating this capability, we demonstrated successful leader-follower formation swimming using flow pressure sensing in our undulatory robotic fish ($\mu$Bot/MUBot). The follower $\mu$Bot is equipped at its head with bilateral pressure sensors to detect signals excited by both its own and the leader's movements. First, using experiments with static formations between an undulating leader and a stationary follower, we determined the formation that resulted in strong pressure variations measured by the follower. This formation was then selected as the desired formation in free swimming for obtaining an expert policy. Next, a long short-term memory neural network was used as the control policy that maps the pressure signals along with the robot motor commands and the Euler angles (measured by the onboard IMU) to the steering command. The policy was trained to imitate the expert policy using behavior cloning and Dataset Aggregation (DAgger). The results show that with merely two bilateral pressure sensors and less than one hour of training data, the follower effectively tracked the leader within distances of up to 200 mm (= 1 body length) while swimming at speeds of 155 mm/s (= 0.8 body lengths/s). This work highlights the potential of fish-inspired robots to effectively navigate fluid environments and achieve formation swimming through the use of flow pressure feedback.
Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning
Vaskov, Sean, Schwarting, Wilko, Baker, Chris L.
Reinforcement Learning (RL) for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment. When considering safety constraints, constrained optimization approaches, where agents are penalized for constraint violations, are commonly used. In such methods, if agents are initialized in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized. We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy. In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem. We present simulation studies on a rover with uncertain road friction and a tractor-trailer parking environment that demonstrate our constraint formulation enables agents to learn safer policies than contemporary constrained RL methods.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
A Novel Variational Lower Bound for Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, highdimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy. Reinforcement learning (RL) is a popular method for automating decision making and control. However, to achieve practical effectiveness, significant engineering of reward features and reward functions has traditionally been necessary.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Blending Imitation and Reinforcement Learning for Robust Policy Improvement
Liu, Xuefeng, Yoneda, Takuma, Stevens, Rick L., Walter, Matthew R., Chen, Yuxin
While reinforcement learning (RL) has shown promising performance, its sample complexity continues to be a substantial hurdle, restricting its broader application across a variety of domains. Imitation learning (IL) utilizes oracles to improve sample efficiency, yet it is often constrained by the quality of the oracles deployed. RPI draws on the strengths of IL, using oracle queries to facilitate exploration--an aspect that is notably challenging in sparse-reward RL-- particularly during the early stages of learning. As learning unfolds, RPI gradually transitions to RL, effectively treating the learned policy as an improved oracle. This algorithm is capable of learning from and improving upon a diverse set of black-box oracles. Integral to RPI are Robust Active Policy Selection (RAPS) and Robust Policy Gradient (RPG), both of which reason over whether to perform state-wise imitation from the oracles or learn from its own value function when the learner's performance surpasses that of the oracles in a specific state. Reinforcement learning (RL) has shown significant advancements, surpassing human capabilities in diverse domains such as Go (Silver et al., 2017), video games (Berner et al., 2019; Mnih et al., 2013), and Poker (Zhao et al., 2022). Despite such achievements, the application of RL is largely constrained by its substantial computational and data requirements and high sample complexity, particularly in fields like robotics (Singh et al., 2022) and healthcare (Han et al., 2023), where the extensive online interaction for trial and error is often impractical. Imitation learning (IL) (Osa et al., 2018) improves sample efficiency by allowing the agent to replace some or all environment interactions with demonstrations provided by an oracle policy.
- Energy (0.68)
- Government > Regional Government (0.46)
- Leisure & Entertainment > Games (0.34)